Research on Model of Network Information Extraction Based on Improved Topic-focused Web Crawler Key Technology
نویسندگان
چکیده
Original scientific paper This research has caught researchers' wide attention for extracting network information exactly with the arrival of the big data era characterized by semistructured or unstructured text. This paper proposes a model of network information extraction based on improved topic-focused web crawler key technology taking Web news as object of extraction. The authors elaborate main function, method and technology on every layer of the model in detail, which have been used or completed, and focuses on how to extract network information efficiently oriented topic from a large number of Web news instances, in order to explore a research method for network information extraction. The experimental results show the feasibility, validity and superiority of the model design and play a very important role in constructing topic-focused Web news corpus so as to provide a real-time data source for trust analysis, currency analysis, hot topic detection, topic evolution tracking of Web news.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملA Focused Crawler Based on Correlation Analysis
With the rapid development of network and information technology, there is a wealth of huge amounts of data on the internet. But it’s a major problem faced by the majority of researchers how to effectively filter out a particular subject or field of information from these data. In this paper, we try to builder a focused crawler based on vector space model and TFIDF text correlation analysis. We...
متن کاملDesign of Improved Web Crawler By Analysing Irrelevant Result
A key issue in designing a focused Web crawler is how to determine whether an unvisited URL is relevant to the search topic. Effective relevance prediction can help avoid downloading and visiting many irrelevant pages. In this module, we propose a new learning-based approach to improve relevance prediction in focused Web crawlers. For this study, we chose Naïve Bayesian as the base prediction m...
متن کاملHybrid focused crawling on the Surface and the Dark Web
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of interest. This work proposes a generic focused crawling framework for discovering resources on any given topic that reside on the Surface or the Dark Web. The proposed ...
متن کاملA Survey on Semantic Focused Web Crawler for Information Discovery Using Data Mining Technique
Data mining is the process of extraction of hidden predictive information from the huge databases. It is a new technology with great latent to help companies focus on the most important information in their data warehouses. Web mining is a data mining techniques which automatically discover information from web documents. The amount of data and its dynamicity makes it impossible to crawl the Wo...
متن کامل